EBA3500 Data analysis with programming
Jonas Moss
BI Norwegian Business School
Department of Data Science and Analytics
The function \(f(x) = c + bx + ax^2\) is called a quadratic function or second-degree polynomial. It takes on four kinds of shapes, depending on the value of \(a,b,c\).
Let the true data generating model be \[Y = \beta_0 + \beta_1X + \beta_2X^2 + \epsilon.\] This a quadratic regression model. It can be regarded as a kind of non-linear regression model, but it usually isn’t.
Define \(X_1 = X\) and \(X_2 = X^2\). Then \[\beta_0 + \beta_1X + \beta_2X^2 = \beta_0 + \beta_1X_1 + \beta_2X_2,\] which implies that the quadratic regression model is actually a multiple linear regression model!
Why do we care? Because linear regression is much easier to compute than non-linear regression!
Five studies examined the relationship between talent and team performance. Two survey studies found that people believe there is a linear and nearly monotonic relationship between talent and performance: Participants expected that more talent improves performance and that this relationship never turns negative. However, building off research on status conflicts, we predicted that talent facilitates performance—but only up to a point, after which the benefits of more talent decrease and eventually become detrimental as intrateam coordination suffers. (Swaab et al., 2014)
So the authors claim there is no increasing relationsship between talent and performance at the top level. That seems plausible considering e.g. Martin Ødegaard!
They did four studies, as is common in psychology, and we will look at one of the football studies. Have a look at the paper for more details.
url = "https://gist.githubusercontent.com/JonasMoss/ae5436bf951da5b3e723ce6fec39e77f/raw/03148a170130686d95f020b81e27bc14b35705ff/talent.csv"
talent = pd.read_csv(url)
sns.lmplot(x = "talent", y = "points", data = talent)<seaborn.axisgrid.FacetGrid at 0x2d706b03be0>
The data is evidently not linear. So let’s try the logarithmic transform.
talent["log_talent"] = np.log(talent["talent"] + 1)
sns.lmplot(x = "log_talent", y = "points", data = talent)<seaborn.axisgrid.FacetGrid at 0x2d7336c3670>
This loooks quite linear!
import statsmodels.formula.api as smf
fit = smf.ols(formula = 'points ~ np.log(talent + 1)', data=talent).fit()
print(fit.summary()) OLS Regression Results
==============================================================================
Dep. Variable: points R-squared: 0.566
Model: OLS Adj. R-squared: 0.564
Method: Least Squares F-statistic: 268.8
Date: Thu, 17 Nov 2022 Prob (F-statistic): 3.30e-39
Time: 15:05:49 Log-Likelihood: -1408.4
No. Observations: 208 AIC: 2821.
Df Residuals: 206 BIC: 2828.
Df Model: 1
Covariance Type: nonrobust
======================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------
Intercept 273.8867 16.311 16.791 0.000 241.729 306.045
np.log(talent + 1) 247.0227 15.068 16.394 0.000 217.315 276.730
==============================================================================
Omnibus: 25.340 Durbin-Watson: 0.952
Prob(Omnibus): 0.000 Jarque-Bera (JB): 30.640
Skew: 0.874 Prob(JB): 2.22e-07
Kurtosis: 3.692 Cond. No. 1.60
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
<seaborn.axisgrid.FacetGrid at 0x2d706e8baf0>
The fitted function is “U-shaped”. It appears that points first increases in talent, then decreases. At least when you look at the fitted curve!
OLS Regression Results
==============================================================================
Dep. Variable: points R-squared: 0.464
Model: OLS Adj. R-squared: 0.459
Method: Least Squares F-statistic: 88.87
Date: Thu, 17 Nov 2022 Prob (F-statistic): 1.61e-28
Time: 15:05:49 Log-Likelihood: -1430.3
No. Observations: 208 AIC: 2867.
Df Residuals: 205 BIC: 2877.
Df Model: 2
Covariance Type: nonrobust
==================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------
Intercept 305.3440 17.627 17.323 0.000 270.591 340.097
talent 54.8979 5.469 10.039 0.000 44.116 65.680
I(talent ** 2) -0.5702 0.075 -7.604 0.000 -0.718 -0.422
==============================================================================
Omnibus: 18.548 Durbin-Watson: 0.654
Prob(Omnibus): 0.000 Jarque-Bera (JB): 21.285
Skew: 0.781 Prob(JB): 2.39e-05
Kurtosis: 3.128 Cond. No. 988.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
How can you model non-linear relationsships? Not all functions are closely approximated by linears, e.g., np.sin(2 * np.pi * x) * x * np.exp(x).
np.vander function to construct design matrices consisting of polynomials.[(5, 770.3176318449504),
(6, 765.6739087918303),
(7, 722.3531744500278),
(8, 631.4232378065403),
(9, 611.365210202308),
(10, 570.3460457476948),
(11, 516.8364848836604),
(12, 518.5146827024718),
(13, 516.993964244756),
(14, 518.9040946323105),
(15, 520.2862547758479)]
Another option is splines.
They are piecewise polynomials glued together.
Just like polynomials, any function can be approximated by splines if the degree of freedom is high enough!
You can create spline bases using the bs function from patsy, e.g., y ~ bs(x, df = 5, degree = 3). The degree corresponds to the degree of the polynomial, three being the common option.
Details are not on the curriculum – but you should know that splines > polynomials for modeling of arbitrary shapes!
Take a look at this video for more information.
We have already modelled this:
Define a fitter function for the desired degrees of freedom:
Optimization terminated successfully.
Current function value: 0.458864
Iterations 10
Logit Regression Results
==============================================================================
Dep. Variable: z No. Observations: 100
Model: Logit Df Residuals: 94
Method: MLE Df Model: 5
Date: Thu, 17 Nov 2022 Pseudo R-squ.: 0.3372
Time: 15:05:50 Log-Likelihood: -45.886
converged: True LL-Null: -69.235
Covariance Type: nonrobust LLR p-value: 6.551e-09
============================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------
Intercept 7.5252 3.288 2.289 0.022 1.081 13.969
bs(x, degree=3, df=5)[0] -12.2394 4.993 -2.451 0.014 -22.026 -2.453
bs(x, degree=3, df=5)[1] -2.2668 2.716 -0.835 0.404 -7.590 3.057
bs(x, degree=3, df=5)[2] -15.3265 4.839 -3.168 0.002 -24.810 -5.843
bs(x, degree=3, df=5)[3] 8.0071 4.657 1.719 0.086 -1.121 17.135
bs(x, degree=3, df=5)[4] -65.6163 20.183 -3.251 0.001 -105.174 -26.058
============================================================================================
Possibly complete quasi-separation: A fraction 0.12 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.
def fit_model(df):
formula = "y ~ bs(x, degree = 3, " + "df = " + str(df) + ")"
return smf.ols(formula, data = pd.DataFrame({"x" :x, "y":y})).fit()
[(df, fit_model(df).aic) for df in range(3, 16)][(3, 769.919848597824),
(4, 768.489935931196),
(5, 770.497611464271),
(6, 733.230555729813),
(7, 585.9675275398621),
(8, 573.4227183375275),
(9, 575.0085899189091),
(10, 543.2839496203144),
(11, 519.8409155393351),
(12, 517.3087684885444),
(13, 513.5928894005049),
(14, 515.7660420431214),
(15, 517.8746224632054)]
Let \(x_1, x_2, x_3, \ldots, x_m\) be iid Bernoulli with some unknown \(p\).
We can estimate \(p\) using $\overline{x}$. Its variance is \(p(1-p)/n\).
But what happens if we need to estimate the odds \(p/(1-p)\) instead of \(p\)? What’s the variance then?
Let \(g\) be a differentiable function and
\[ \sqrt{n}(\hat{\theta}_n -\theta)\stackrel{d}{\to}N(0,\sigma^2) \]
where \(\sigma^2\) is the asymptotic variance. Then
\[ \sqrt{n}(g(\hat{\theta}_n) -g(\theta))\stackrel{d}{\to}N(0,[g'(\theta)]^2\sigma^2) \]
In our case \(g(p)=p/(1-p)\).
Its derivative is \(1/(1-p)^2\). (Found using the quotient rule.)
Recall that the asymptotic variance of \(\overline{x}\) (i.e., variance of \(\sqrt{n}(\overline{x}-p)\) ) is \(\sigma^2 = p(1-p)\).
Thus \(g'(p)^2\sigma^2=p(1-p)/(1-p)^4=p/(1-p)^3\).
Hence the variance of the estimated odds is \(\overline{x}/(1-\overline{x})\) is \(p/(1-p)^3/n\)
Suppose \(X\) has variance \(\sigma^2\) and unknown mean \(\mu\). How can we estimate \(\mu^2\) and what is its asymptotic variance?
Suppose \(\hat{\beta}_1\) is a regression coefficient from the logistic model. What is the asymptotic variance of \(\exp{\hat{\beta}_1}\)?
Answer: Recall that the entire parameter vector \(\hat{\beta}\) has covariance matrix equal to the inverse of the Fisher information \(I^{-1}\), hence the variance of \(\hat{\beta}_i\) is the \(i\)th diagonal element of \(I^{-1}\). Now apply the delta method to \(g(x)=\exp(x)\), yielding \((g'(x))^2 = \exp(2x)\). Hence the asymptotic variance of \(\exp{\hat{\beta}_1}\) is \(\exp(2\beta_1)I^{-1}_{ii}\).
Recall that the gradient of function \(g:\mathbb{R}^k\to\mathbb{R}\) is \[\nabla g(\theta)=\left[\begin{array}{cccc} \frac{\partial g}{\partial\theta_{1}} & \frac{\partial g}{\partial\theta_{2}} & \cdots & \frac{\partial g}{\partial\theta_{k}}\end{array}\right]\]
Example. Consider the function \[g(\theta)=\theta_{1}^{2}+\theta_{2}^{2}+\theta_{3}^{2}\]
The gradient is \[\nabla g(\theta)=\left[\begin{array}{c} \frac{\partial g}{\partial\theta_{1}}\\ \frac{\partial g}{\partial\theta_{2}}\\ \frac{\partial g}{\partial\theta_{k}} \end{array}\right]=2\left[\begin{array}{c} \theta_{1}\\ \theta_{2}\\ \theta_{3} \end{array}\right]\]
Assume that \(\theta\) is a vector and \[\sqrt{n}(\hat{\theta}-\theta)\stackrel{d}{\to}N(0,\Sigma).\] The matrix \(\Sigma\) is the asymptotic covariance matrix.
\[\sqrt{n}(g(\hat{\theta})-g(\theta))\stackrel{d}{\to}N(0,\nabla g(\theta)^{T}\Sigma\nabla g(\theta))\]
cov_params().n*model.cov_params().Optimization terminated successfully.
Current function value: 0.573147
Iterations 6
Logit Regression Results
==============================================================================
Dep. Variable: admit No. Observations: 400
Model: Logit Df Residuals: 394
Method: MLE Df Model: 5
Date: Thu, 17 Nov 2022 Pseudo R-squ.: 0.08292
Time: 15:05:51 Log-Likelihood: -229.26
converged: True LL-Null: -249.99
Covariance Type: nonrobust LLR p-value: 7.578e-08
================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------
Intercept -3.9900 1.140 -3.500 0.000 -6.224 -1.756
C(rank)[T.2] -0.6754 0.316 -2.134 0.033 -1.296 -0.055
C(rank)[T.3] -1.3402 0.345 -3.881 0.000 -2.017 -0.663
C(rank)[T.4] -1.5515 0.418 -3.713 0.000 -2.370 -0.733
gre 0.0023 0.001 2.070 0.038 0.000 0.004
gpa 0.8040 0.332 2.423 0.015 0.154 1.454
================================================================================
cov_params()cov_params(). (In our notation, this equals \(\tau/n\)) Intercept C(rank)[T.2] C(rank)[T.3] C(rank)[T.4] gre \
Intercept 1.299488 -0.084476 -0.048644 -0.089431 -0.000301
C(rank)[T.2] -0.084476 0.100166 0.069566 0.070127 -0.000002
C(rank)[T.3] -0.048644 0.069566 0.119237 0.069742 0.000019
C(rank)[T.4] -0.089431 0.070127 0.069742 0.174583 0.000012
gre -0.000301 -0.000002 0.000019 0.000012 0.000001
gpa -0.303660 0.004521 -0.009469 0.003568 -0.000124
gpa
Intercept -0.303660
C(rank)[T.2] 0.004521
C(rank)[T.3] -0.009469
C(rank)[T.4] 0.003568
gre -0.000124
gpa 0.110104
Let \(y=f(x)\) be the true relationship between \(x\) and \(y\). Then \(\hat{f}(x)\pm w(x)\) is a \(1-\alpha\)-level pointwise confidence band if
\[ P(\hat{f}(x)-w(x)\leq f(x) \leq \hat{f}(x)+w(x)) = 1-\alpha. \]
Usually \(\alpha = 0.1\) or \(\alpha = 0.05\).
This is not a prediction band! We are quantifying the error in the estimation of \(f(x)\), not the error in a prediction.
If the true model is \(y=f(x)+\epsilon\), we would need to take \(\epsilon\) into account when quantifying the error in prediction.
This is an implementation of confidence bands for logistic regression models with on covariate \(x\).
def logistic_conf_bands(formula, x, data):
""" Plot logistic 95% confidence bands for `formula` with x-axis variable
`x` and data `data`."""
model = smf.logit(formula, data = data).fit()
X = model.model.exog
preds = model.predict()
F = lambda x: 1/(1+np.exp(-x))
f = lambda x: np.exp(-x)/(1 + np.exp(-x)) ** 2
g = lambda x: x @ model.cov_params() @ x * f(model.params @ x) ** 2
taus_sqrt = np.sqrt(np.array([g(y) for y in np.array(X)]))
hats = np.array([F(model.params @ y) for y in np.array(X)])
df = pd.DataFrame({
'x': np.array(data[x]),
'y': preds,
'ymin': hats - 1.96*taus_sqrt,
'ymax': hats + 1.96*taus_sqrt})
ax = df.plot(x = 'x', y = 'y')
ax.fill_between(x = df['x'], y1 = df['ymax'], y2 = df['ymin'], alpha=.2)
ax.set_xlim(left=0, right=1)
ax.set_ylim(bottom=0, top=1)
ax.get_legend().remove() import numpy as np
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import pandas as pd
predictions = pd.read_csv("data/predictions.csv")
predictions = predictions.sort_values(by = "prediction")
predictions.head()| prediction | outcome | |
|---|---|---|
| 114 | 0.03 | 1 |
| 113 | 0.03 | 0 |
| 112 | 0.03 | 0 |
| 111 | 0.04 | 0 |
| 110 | 0.05 | 0 |
formula = "outcome ~ prediction"
logistic_conf_bands(formula, "prediction", predictions)
plt.show()
smf.logit(formula, data = predictions).fit().aicOptimization terminated successfully.
Current function value: 0.587627
Iterations 5
Optimization terminated successfully.
Current function value: 0.587627
Iterations 5
139.15420507407748
formula = "outcome ~ prediction + I(prediction**2)"
logistic_conf_bands(formula, "prediction", predictions)
plt.show()
smf.logit(formula, data = predictions).fit().aicOptimization terminated successfully.
Current function value: 0.584836
Iterations 5
Optimization terminated successfully.
Current function value: 0.584836
Iterations 5
140.5122461104003
from scipy.special import logit
formula = "outcome ~ logit(prediction)"
logistic_conf_bands(formula, "prediction", predictions)
plt.show()
smf.logit(formula, data = predictions).fit().aicOptimization terminated successfully.
Current function value: 0.582583
Iterations 5
Optimization terminated successfully.
Current function value: 0.582583
Iterations 5
137.99401604753055
formula = "outcome ~ prediction + logit(prediction)"
logistic_conf_bands(formula, "prediction", predictions)
plt.show()
smf.logit(formula, data = predictions).fit().aicOptimization terminated successfully.
Current function value: 0.580761
Iterations 6
Optimization terminated successfully.
Current function value: 0.580761
Iterations 6
139.57507719177076
formula = "outcome ~ bs(prediction, degree = 3, df = 5)"
logistic_conf_bands(formula, "prediction", predictions)
plt.show()
smf.logit(formula, data = predictions).fit().aicOptimization terminated successfully.
Current function value: 0.573171
Iterations 6
Optimization terminated successfully.
Current function value: 0.573171
Iterations 6
143.8294043601309
formula = "outcome ~ np.vander(prediction, 7, increasing=True) - 1"
logistic_conf_bands(formula, "prediction", predictions)
plt.show()
smf.logit(formula, data = predictions).fit().aicWarning: Maximum number of iterations has been exceeded.
Current function value: 0.563228
Iterations: 35
Warning: Maximum number of iterations has been exceeded.
Current function value: 0.563228
Iterations: 35
143.542416570845
formula = "outcome ~ bs(logit(prediction), degree = 3, df = 5)"
logistic_conf_bands(formula, "prediction", predictions)
plt.show()
smf.logit(formula, data = predictions).fit().aicOptimization terminated successfully.
Current function value: 0.566878
Iterations 7
Optimization terminated successfully.
Current function value: 0.566878
Iterations 7
142.38183686937128
\[X=\left[\begin{array}{cccc}1 & 1 & 1 & 0\\1 & 2 & 4 & 2\\1 & 3 & 9 & 6\\1 & 4 & 16 & 12\\1 & 5 & 25 & 20\end{array}\right]\]
\[X=\left[\begin{array}{cccc} 1 & 1 & 1 & 0\\ 1 & 2 & 4 & 2\\ 1 & 3 & 9 & 6\\ 1 & 4 & 16 & 12\\ 1 & 5 & 25 & 20 + 0.000001\\ \end{array}\right]\]
import numpy as np
x = lambda eps: np.array(
[[1,1,1,0],
[1,2,4,2],
[1,3,9,6],
[1,4,16,12],
[1,5,25,20 + eps]]
)
y = x(0)
np.linalg.inv(y @ y.transpose())LinAlgError: Singular matrix
That’s expected, since we know the matrix is singular. But how about small \(\epsilon\)s?
array([[ 2.19902425e+13, -5.76632937e+13, 4.10484400e+13,
2.93203394e+12, -8.30742271e+12],
[-6.15726735e+13, 1.62239084e+14, -1.17281250e+14,
-5.86406690e+12, 2.24789060e+13],
[ 5.27765728e+13, -1.40737509e+14, 1.05553120e+14,
1.95468794e+06, -1.75921857e+13],
[-8.79609367e+12, 2.54109362e+13, -2.34562481e+13,
5.86406201e+12, 9.77343561e+11],
[-4.39804814e+12, 1.07507823e+13, -5.86406201e+12,
-2.93203101e+12, 2.44335890e+12]])
That worked!
But that didn’t.
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 8.700e+29
Date: Thu, 17 Nov 2022 Prob (F-statistic): 1.15e-30
Time: 15:05:53 Log-Likelihood: 144.97
No. Observations: 5 AIC: -283.9
Df Residuals: 2 BIC: -285.1
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 1.0000 2.1e-13 4.76e+12 0.000 1.000 1.000
x1 1.0000 9.82e-14 1.02e+13 0.000 1.000 1.000
x2 4.0000 3.64e-14 1.1e+14 0.000 4.000 4.000
x3 3.0000 6.19e-14 4.84e+13 0.000 3.000 3.000
==============================================================================
Omnibus: nan Durbin-Watson: 0.208
Prob(Omnibus): nan Jarque-Bera (JB): 0.647
Skew: 0.685 Prob(JB): 0.724
Kurtosis: 1.892 Cond. No. 1.13e+17
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.26e-31. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 4.848e+13
Date: Thu, 17 Nov 2022 Prob (F-statistic): 1.06e-07
Time: 15:05:53 Log-Likelihood: 54.150
No. Observations: 5 AIC: -100.3
Df Residuals: 1 BIC: -101.9
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 1.0000 2.98e-05 3.35e+04 0.000 1.000 1.000
x1 2.0000 3168.535 0.001 1.000 -4.03e+04 4.03e+04
x2 3.0000 3168.535 0.001 0.999 -4.03e+04 4.03e+04
x3 4.0000 3168.535 0.001 0.999 -4.03e+04 4.03e+04
==============================================================================
Omnibus: nan Durbin-Watson: 0.355
Prob(Omnibus): nan Jarque-Bera (JB): 0.635
Skew: -0.736 Prob(JB): 0.728
Kurtosis: 2.059 Cond. No. 2.06e+10
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 3.81e-18. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
0.00000001 changed three coefficient estimates dramatically!Almost collinear covariate matrices make inference hard and can produce bouncing betas.
Quadratic regression is popular. But it’s usually not a good idea to use it!
Polynomial regression is bad; regression splines are much better.
Any function \(g(x)\) be used to transform variables in a regression model.